Combinatorial and Probabilistic Approaches to Motif Recognition
نویسنده
چکیده
Short substrings of genomic data that are responsible for biological processes, such as gene expression, are referred to as motifs. Motifs with the same function may not entirely match, due to mutation events at a few of the motif positions. Allowing for non-exact occurrences significantly complicates their discovery. Given a number of DNA strings, the motif recognition problem is the task of detecting motif instances in every given sequence without knowledge of the position of the instances or the pattern shared by these substrings. We describe a novel approach to motif recognition, and provide theoretical and experimental results that demonstrate its efficiency and accuracy. Our algorithm, MCL-WMR, builds an edge-weighted graph model of the given motif recognition problem and uses a graph clustering algorithm to quickly determine important subgraphs that need to be searched further for valid motifs. By considering a weighted graph model, we narrow the search dramatically to smaller problems that can be solved with significantly less computation. The Closest String problem is a subproblem of motif recognition, and it is NP-hard. We give a linear-time algorithm for a restricted version of the Closest String problem, and an efficient polynomial-time heuristic that solves the general problem with high probability. We initiate the study of the smoothed complexity of the Closest String problem, which in turn explains our empirical results that demonstrate the great capability of our probabilistic heuristic. Important to this analysis is the introduction of a perturbation model of the Closest String instances within which we provide a probabilistic analysis of our algorithm. The smoothed analysis suggests reasons why a well-known fixed parameter tractable algorithm solves Closest String instances extremely efficiently in practice. Although the Closest String model is robust to the oversampling of strings in the input, it is severely affected by the existence of outliers. We propose a refined model, the Closest String with Outliers problem, to overcome this limitation. A systematic parameterized complexity analysis accompanies the introduction of this problem, providing a surprising insight into the sensitivity of this problem to slightly different parameterizations. Through the application of probabilistic and combinatorial insights into the Closest String problem, we develop sMCL-WMR, a program that is much faster than its predecessor MCL-WMR. We apply and adapt sMCL-WMR and MCL-WMR to analyze the promoter regions of the canola seed-coat. Our results identify important regions of the canola genome that are responsible for specific biological activities. This knowledge may be used in the long-term aim of developing crop varieties with specific biological characteristics, such as being disease-resistant.
منابع مشابه
Identification of Transcription Factor Binding Sites in Promoter Regions by Modularity Analysis of the Motif Co-occurrence Graph
Many algorithms have been proposed to date for the problem of finding biologically significant motifs in promoter regions. They can be classified into two large families: combinatorial methods and probabilistic methods. Probabilistic methods have been used more extensively, since their output is easier to interpret. Combinatorial methods have the potential to identify hard to detect motifs, but...
متن کاملCombinatorial Algorithms for Approximate Words
The search and the analysis of motifs on the genome and the proteome is a very active domain in computational biology. The so-called formal approaches search for exceptional words on an entire genome, or some part of it. The specificity of our algorithmic approach is the combination of recent results in probability and combinatorics with highlevel pattern matching algorithms. In this paper, we ...
متن کامل6.895 Project: Combinatorial Regulatory Motif Finding
In this class, we have seen many algorithms for regulatory motif finding. In general, these algorithms fall into two categories – probabilistic and combinatorial. Probabilistic solutions, such as expectation-maximization and Gibbs sampling have had much empirical success, but they are not well understood from a theoretical standpoint. For combinatorial solutions, either the algorithms use exhau...
متن کاملPersian Handwritten Digit Recognition Using Particle Swarm Probabilistic Neural Network
Handwritten digit recognition can be categorized as a classification problem. Probabilistic Neural Network (PNN) is one of the most effective and useful classifiers, which works based on Bayesian rule. In this paper, in order to recognize Persian (Farsi) handwritten digit recognition, a combination of intelligent clustering method and PNN has been utilized. Hoda database, which includes 80000 P...
متن کاملانتخاب ناحیههای کاندید در سیستمهای تشخیص و شناسایی اشیاء
According to the studies carried out in recent years, determination of the regional proposal is one of the crucial steps in detection and recognition of the objects included in an image. In fact, determination of this region has been like a bottleneck, gaining a significant computational energy. As a result, selection of suitable and fast approaches, under these circumstance, may enhance the pe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010